FLAME end-to-end: partitioner fix, execution metadata, setup callback#37
Merged
Conversation
materialized.data returns [[%Adbc.Column{}]] (list of batches),
not [%Adbc.Column{}]. find_column was iterating the outer list,
getting a batch (list) instead of individual columns, then crashing
on col.field.name.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adbc.Column structs don't implement Enumerable. Need to use Adbc.Column.to_list/1 to extract values from Arrow columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sequential place_child calls all landed on the same runner since there was no backpressure. Concurrent Task.async placement forces the pool to boot new runners when max_concurrency is 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrent placement caused GenServer call timeouts when multiple runners booted simultaneously. Sequential is correct — with max_concurrency: 1, each placed child holds its slot permanently, guaranteeing the next place_child goes to a new runner. Added :timeout option (default 120s) to handle slow Fly machine boots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pool needs time to process the slot replacement message (caller PID → child PID) before the next checkout attempt. Without this, sequential place_child calls can all see the runner at count 0 and land on the same runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root cause of all workers landing on the same runner was FLAME's default max_concurrency of 100. DuckDB saturates cores internally, so one worker per machine is optimal. - Add max_concurrency: 1 to all FLAME pool examples - Fix LIVEBOOK_COOKIE: use Node.get_cookie() not System.get_env() - Add boot_timeout: 120_000 for Fly cold starts - Remove debug logging from spin_up - Document why max_concurrency: 1 matters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- spin_up uses concurrent Task.async for parallel runner boot - status accepts workers list directly (remote PIDs can't use :pg) - Remove await_pg_registration (was using wrong :pg scope) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Execution metadata:
- Add `meta` field to %Dux{} struct
- Coordinator populates meta with distributed execution stats:
n_workers, n_nodes, nodes, merge_strategy, total_duration_ms
- kino_dux can read meta for rich rendering
Setup callback:
- Worker.setup/3 runs a function on the worker's node
- spin_up/2 accepts :setup option, runs on all workers after boot
- Enables per-worker S3 secrets, extension loading, etc.
Concurrent spin_up:
- Uses Task.async for parallel runner boot
- status/1 accepts workers list directly (remote :pg unreliable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrent Task.async + place_child causes internal FLAME GenServer
timeouts ({:placed_child, ...} 5s default). Sequential placement
works correctly — with max_concurrency: 1, each placement still
boots a separate runner.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end FLAME distributed queries now work from Livebook on Fly.io. This PR bundles all fixes discovered during testing:
Partitioner fixes
materialized.datais[[%Adbc.Column{}]], not[%Adbc.Column{}])Adbc.Column.to_list/1instead ofEnum.to_list/1for Arrow columnsFLAME improvements
spin_up/2viaTask.asyncfor parallel runner boot:setupcallback option — runs on each worker after boot (e.g. S3 secrets, extensions)status/1to accept workers list directly (remote:pgunreliable across FLAME nodes)await_pg_registration(was using wrong:pgscope)Execution metadata
metafield to%Dux{}structmetawith:n_workers,n_nodes,nodes,merge_strategy,total_duration_msmetafor rich distributed stats renderingDocumentation
max_concurrency: 1in all FLAME pool examples (FLAME default is 100 — all workers land on same machine)LIVEBOOK_COOKIEviaNode.get_cookie()notSystem.get_env()boot_timeout: 120_000for Fly cold startsUsage
Test plan
🤖 Generated with Claude Code